Predicting the Injuriousness of Traffic Collisions Occuring in the City of Chicago: A Final Report
Data Science 2 with R (STAT 301-2)
Introduction
In this project, I will investigate the following predictive question: will any of the unfortunate individuals involved in a traffic collision — both motorists and non-motorists alike — emerge injured? This is a classification problem because it investigates a question of whether a traffic incident will be injurious instead of how many people are injured.
I want to investigate this predictive problem in particular because of the ubiquity of driving: more than 80 percent of U.S. adults are licensed drivers, with over 90 percent using motor vehicles to transport themselves to-and-from work (Hedges & Company, 2018; Dews, 2013). Driving is relatively cheap and time-efficient, providing individuals with a large amount of geographic freedom and autonomy. Yet, there are significant costs to driving: more cars on the road often corresponds to a higher frequency of traffic collisions, many of which can result in debilitating injuries and even death. In fact, the U.S. experiences more motor-vehicle fatalities in both absolute and per-capita terms than any other high-income country (Yellman & Sauber-Schatz, 2022). In attempting to predict the “price tag” of driving (in the form of injurious collisions), I hope to minimize the heavy costs associated with an activity that has become so prevalent in, and important to, daily life.
To build my predictive modelling process, I will be analyzing collision-level crash data covering traffic incidents occurring within City of Chicago limits and under the jurisdiction of the Chicago Police Department since 20151. Approximately half of the observations are self-reported at the police district by the agent(s) involved; the other half are recorded by the responding police officer.
Data Overview
In its raw form, my dataset on Chicago traffic collisions describes over 800,000 crashes with 50 columns which — after basic cleaning (cleaning column names, factorizing relevant variables, collapsing factor levels) — consist of 27 factor variables, 17 integer variables, 5 string variables, and 1 logical variable. A dataset-wide missingness analysis reveals the following results.
Table 1 reinforces Figure 1 by showing that 11/50 variables in the data see a missingness rate of over 65%, with the top 8 seeing missing values for over 90% of the entire dataset — a magnitude of missingness that is certainly concerning, especially since it limits my flexibility for feature engineering/selection. Fortunately, the rate of missingness drops precipitously for the remaining 39 variables, which see either zero or close-to-zero missingness.
To construct my outcome variable of interest, I condition it using if_else() on one of the preexisting variables, injuries_total, which is an integer measurement of the number of individuals sustaining fatal, incapacitating, non-incapacitating, or possible injuries within a given traffic collision. The resulting binary outcome variable — injurious — is appended to the dataset as a new column and is coded as:
• Yes if injuries_total exceeds zero;
• No if injuries_total equals zero; and
• NA if injuries_total is a missing value
I have chosen to take the “classification route” with the new injurious variable — as opposed to the “regression route” with existing injuries_total variable — so as to preserve the injury-focused nature of my initial question whilst, at the same time, streamline the ensuing data downsizing/balancing process2.
2 Described in the following section
The following exhibits will explore the missingness and distribution of my injurious outcome variable using the original data.
injurious
| variable | n_miss | pct_miss |
|---|---|---|
| injurious | 1757 | 0.2187653 |
The outcome variable injurious fortunately sees a missingness rate of only 0.22% (Table 2), which means that the process of “throwing out” missing values during recipe building should not generate significant disruptions/bias. Next, I will explore the distribution of injurious across the raw dataset after excluding the 1,757 (out of 803,144) observations that lack data on injuriousness.
injurious, visualized
injurious, summarized
| injurious | n |
|---|---|
| Yes | 110241 |
| No | 691146 |
Figure 2 and Table 3 reveal a large amount of class imbalance in injurious: non-injurious collisions significantly outnumber injurious collisions, with the ratio between the two classes exceeding 6:1. This warrants 2 additional steps that will be taken — dataset downsizing and stratified random sampling — which will be described in further detail in the next section.
Specifically, this next section will explore in detail my steps of data splitting, model building/tuning, and recipe engineering.
Methods
Dataset Downsampling/Downsizing
Prior to spending my dataset, I first run it through 2 prerequisite steps: downsampling and downsizing. The sheer size of my dataset in its raw form (with 800,000+ rows) is unsustainable given my computational and temporal constraints; additionally, the class imbalance existing within my binomial outcome variable, injurious, warrants adjustments. I address both of these concerns in tandem through these 2 steps:
- Downsampling: After throwing out missing data (~0.22% of rows), I use
slice_sample()to randomly downsample my dataset with respect to the underrepresented class ininjurious, “Yes”. This reduces the number of observations from over 800,000 to about 220,0003 and at the same time ensures a 1:1 class balance in my dummy outcome variable.
- Additional downsizing: A 220,000-row dataset still exceeds my computational limits; as such, I further downsize my sample via random selection using
initial_split(). The result is a ~44,000-row final dataset, exactly 20% as large as post-downsampling dataset and roughly 5-6% as large as the original dataset.
3 The number of observations in the underrepresented outcome variable class — collisions that are injurious — is about 110,000 in the raw dataset
Data Splitting/Resampling
Next, I use an 80:20 proportion combined with stratified random sampling (with respect to injurious) in initial_split() in order to split this final dataset into training and testing sets. Within the training set of roughly 35,000 rows, I then use V-fold cross-validation to generate resampled data on which I later conduct my model competition process. Specifically, my resampling process entails randomly partioning my training set into 5 subsets (v = 5) repeated 3 times (repeats = 3), generating 15 resamples/folds — each of which contains roughly 7,000 observations4 — on which my various models are trained and evaluated.
4 Within each 7,000-row fold, 80% of rows are allocated to training and the remaining 20% are allocated to testing since v = 5
<Training/Testing/Total>
<35276/8820/44096>
| n |
|---|
| 35276 |
| n |
|---|
| 8820 |
The series of visuals in Table 4 reveal that the 80:20-proportion split has been successfully implemented.
Model Building/Tuning
I define and, using a regular grid, tune the 7 following models5 — the first 2 being baseline models — for use in my model competition:
5 To accomodate the binomial nature of injurious, all 7 models use mode = “classification”
- Null: A simple baseline null model defined using
null_model()with theparsnipengine.
- Naive Bayes: A simple “step-up” baseline model defined using
naive_Bayes()with theklaRengine.
- Logistic regression: A parametric non-regularized regression model defined using
logistic_reg()with theglmengine.
- Elastic net: A parametric regularized regression model defined using
logistic_reg()with theglmnetengine and the following tuning parameters:
- Mixture explored over [0, 1] with 10 levels
- Penalty explored over [-3, 0] with 10 levels
- Mixture explored over [0, 1] with 10 levels
- K-nearest neighbors: A non-parametric algorithm defined using
nearest_neighbor()with thekknnengine and the following tuning parameter:
- Neighbors explored over [1, 15] with 5 levels
- Neighbors explored over [1, 15] with 5 levels
- Random forest: A non-parametric, independently-trained algorithm defined using
rand_forest()with therangerengine, 500trees, and the following tuning parameters:
- Number of predictors randomly sampled at each split explored over [1, 5] with 4 levels
- Minimum number of node data points required for further splitting explored over [2, 40] with 4 levels
- Number of predictors randomly sampled at each split explored over [1, 5] with 4 levels
- Boosted tree: A non-parametric, sequentially-trained algorithm defined using
boost_tree()with thexgboostengine, 500 trees, and the following tuning parameters:
- Number of predictors randomly sampled at each split explored over [1, 5] with 4 levels
- Minimum number of node data points required for further splitting explored over [2, 40] with 4 levels
- Learning rate explored over [-5, -0.2] on log-10 scale with 4 levels
- Number of predictors randomly sampled at each split explored over [1, 5] with 4 levels
Recipe Engineering
Next, the recipes I use in conjuction with my 7 models are constructed along 2 independent dimensions:
- “Kitchen sink” .vs. “feature engineered”: These differ on the basis of feature selection; my kitchen sink recipe uses as many predictors as possible, while my feature engineered recipe is more selective with the predictors used
- The kitchen sink feature selection only filters out 26 “unacceptable” predictors6 which are variables that:
- See missingness rates in excess of 90% (e.g.
workers_present_i); - Are too closely correlated with
injurious(e.g.injuries_total); - Contain too many factor levels, often because they serve as identifiers (e.g.
crash_record_id)
- See missingness rates in excess of 90% (e.g.
- The feature engineered selection, on the other hand, actively includes 11 predictors7 —
alignment,posted_speed_limit,lane_cnt,intersection_related_i,trafficway_type,device_condition,report_type,first_crash_type,num_units,lighting_condition, andmonth— which I select on the basis of having observable/notable bivariate relationships withinjurious8
- The kitchen sink feature selection only filters out 26 “unacceptable” predictors6 which are variables that:
- Parametric .vs. non-parametric: Recipes that are compatible with my 3 parametric models (null, logistic regression, and elastic net) differ from recipes meant for my 3 non-parametric models (nearest neighbors, random forest, and boosted tree) in 2 ways…
- Unlike their parametric counterparts, my non-parametric recipes use one-hot encoding when converting factor variables into numeric terms
- Unlike their non-parametric counterparts, my parametric recipes incorporate interaction terms between 5 predictors9:
lighting_condition,num_units,trafficway_type_,intersection_related_i, andalignment
6 View Appendix: Technical Info to see which 26 variables are filtered out
7 refer to README.md within the data/ subdirectory for variable definitions
8 Refer to Appendix: EDA for univariate/bivariate/multivariate analyses of predictors in relation to my predictive problem
9 This comparison only applies for the feature engineered parametric/non-parametric recipes; the kitchen sink recipe is intentionally kept simple in its omission of interaction terms
These 2 dimensions alone generate 4 possible recipe combinations: parametric + kitchen sink, parametric + feature engineered, non-parametric + kitchen sink, and non-parametric + feature engineered. In addition to their differences, all models share the same basic pre-processing steps of a) imputing missing predictor values using nearest neighbors; b) dummy-encoding all factor predictors; c) removing predictor variables with zero variance; and d) centering/scaling all numeric predictors.
Importantly, I include an additional recipe designed exclusively for the naive Bayes model; this recipe is identical to the parametric + kitchen sink model, except it omits the pre-processing step of dummy-encoding factor variables. I therefore end up with 5 total recipes: the 4 combinations detailed above and the naive-Bayes-specific specification.
Assessment Metric
So that I can systematically compare the predictive performances of my various models and their recipe specifications, I will use the accuracy assessment metric — which measures the proportion of observations guessed correctly by a given model — for its easy interpretability as well as its compatibility with the binomial nature of my injurious outcome variable.
Model Building & Selection Results
Model Candidates
In the ensuing model competition process, my 2 baseline models (null and naive Bayes) are individually matched with 1 recipe10, while the other 5 more-complex models are each individually matched with 2 recipes. The recipe-by-recipe breakdown is as follows:
10 The naive Bayes model uses its designated recipe specification while the null model uses the parametric + kitchen sink recipe specification
- Naive Bayes recipe only: matched only to the naive Bayes model;
- Parametric + kitchen sink recipe only: matched only to the null model;
- Parametric + kitchen sink recipe AND parametric + feature engineered recipe: matched to the logistic regression and elastic net models;
- Non-parametric + kitchen sink recipe AND non-parametric + feature engineered recipe: matched to the nearest neighbors, random forest, and boosted tree models
Fitted Results: General Takeaways
The following exhibit displays the accuracy results of the best-performing candidates for each model/recipe combination11, fitted and then averaged across the 15 resamples/folds:
11 For context, complex models that are fit using kitchen sink recipes are denoted with “1”, and complex models that are fit using feature engineered recipes are denoted with “2”
accuracy metric
Below is the same table, but rearranged such that the best-performing models are at the top:
accuracy metric, arranged in descending order
There are 3 general takeaways from Table 5 and Table 6:
- The top-performing individual model per the
accuracymetric is the boosted tree model fit using the kitchen sink recipe
- Across all complex model types, the kitchen sink recipe strictly dominates the feature engineered recipe when it comes to predictive accuracy
- This suggests that, in my case, feature engineering is not worth it: it adds additional effort and yields worse results
- This does not necessarily suggest, however, that a kitchen sink strategy for feature selection is in general strictly superior to a more selective one
- In my case, it is likely that I may have simply omitted key variables that I should have included in the construction of my feature engineered recipe
- This suggests that, in my case, feature engineering is not worth it: it adds additional effort and yields worse results
- Among the top-performing complex models, there appear to be close-to-zero differences in predictive performance
- For instance, the difference in mean
accuracybetween my top performer (boosted1) and the “runner-up” (rf1) is only 0.0003496, which is merely a 0.0457% difference in performance
- Nonetheless, the 5 non-baseline models do appear to perform with greater predictive accuracy than the baseline
nullandnbayesmodels at a non-insignificant level
- For instance, the difference in mean
The Top-Performing Model Candidate
Table 6 reveals that, holding the recipe constant, there are extremely small differences accuracy-wise between the boosted tree, random forest, logistic regression, and elastic net models to the point that it becomes difficult to definitively settle on a “best” model candidate. For the purpose of this project, however, will I choose boosted1 — the boosted tree model fit using the kitchen sink recipe — to be my final model candidate; I do this for 2 overarching reasons:
- Even though the random forest model outperforms the boosted tree model on the feature engineered recipe (compare
rf2toboosted2), it has already been established that the kitchen sink recipe is strictly dominant regardless of model type
- To this point, the best-performing candidate on the kitchen sink recipe is the boosted tree model
- To this point, the best-performing candidate on the kitchen sink recipe is the boosted tree model
- Since the difference in predictive performance between the top candidates is near-zero, the “best” option for me is still to default to the top-performing candidate — regardless of the gap in performance — unless I have a compelling external reason to not do so
- In this case, the fact that the “performance gaps” between candidates remain consistently small across both recipes tells me that there appears to be no compelling reason to actively avoid the boosted tree model
More specifically, the top-performing model (boosted1) on average correctly predicts the injury status of approximately 76.60% of collisions across the 15 resamples/folds. Note, however, that this winning candidate represents only 1 of the 64 boosted tree models created via tuning combinations during the model-building process. Therefore, the following exhibits explore in further detail the particular tuning parameters of the the winning boosted1 model:
Figure 3 reveals a few notable findings regarding boosted tree tuning:
- Top-performing boosted tree models per the
accuracymetric appear to cluster around a particular point in the bottom-right graph (learn_rate= -1.8, log scale)
- At this point,
mtryequals its maximum of 5 whilemin_nequals its minimum of 2
- At this point,
- A higher
mtryvalue over a range of [1, 5] generally corresponds to greater model performance but this cannot be extrapolated across all cases
- Notably, this trend reverses at the highest possible
learn_rateover [-5, -0.2] of -0.2 (in log scale)
- Notably, this trend reverses at the highest possible
- A lower
min_nvalue over a range of [2, 40] generally corresponds to greater model performance, but this cannot be extrapolated across all cases
- In parallel to above’s finding, this trend reverses at the highest possible
learn_rateof -0.2 (in log scale)
- In parallel to above’s finding, this trend reverses at the highest possible
The endline result from Figure 3 regarding our singular best-performing boosted tree model can be summarized in the following table:
boosted1
| mtry | min_n | learn_rate | .config |
|---|---|---|---|
| 5 | 2 | 0.0158489 | Preprocessor1_Model36 |
The following statement can be drawn from Table 7: conditional on having a learn_rate equaling -1.8 (log-scaled), my boosted tree model performs with the greatest predictive accuracy using the smallest possible min_n and largest possible mtry values of 2 and 5 respectively. The most puzzling conclusion from this tuning analysis is that there is no clear-cut answer regarding the optimal learn_rate; it does not seem to follow a predictable “ceteris-paribus” pattern with respect to accuracy, which is particularly concerning because the value of learn_rate appears to also determine the optimal hyperparameter values for min_n and mtry as well. As such, a case can be made that further tuning should be explored with respect to the learn_rate parameter; perhaps a more optimal value exists and can potentially be uncovered via, for example, an exploration conducted across more levels.
Additional Tuning Analysis: Other Candidates
The following exhibit explores the optimal tuning parameter values for the best-performing representatives of the 3 other model types marked for tuning: random forest, K-nearest neighbors, and elastic net.
Table 8 reveals that the optimal tuning hyperparameters for each model are generally consistent across the kitchen sink and feature engineered recipes; the only exception to this is the elastic net model, whose optimal hyperparameter value for penalty slightly differs between the two. As for the remaining 2 models:
- K-nearest neighbors model: Its predictive performance is optimized with the largest possible number of
neighbors(15) - Random forest model: Its predictive performance is optimized with the largest
mtryvalue (5) and a moderate number ofmin_n(27)
Building More Complex Models: Is It Worth It?
Perhaps one of the most important queries to investigate in the context of model competition/selection is whether it is truly worth it to build more complex models which, in my case, are the logistic regression, elastic net, K-nearest neighbors, random forest, and boosted tree models. Given my results from Table 6, I draw the following 2 conclusions:
- At a fundamental level, it appears to be certainly worth it to build more complex models beyond the baselines
- Relative to the null baseline model, the top 7 competitors see accuracy rates at least 25 percentage points (or 50%) higher
- Relative to the naive Bayes baseline model, the same top 7 competitors see accuracy rates around 10-11 percentage points (or 15-17%) higher
- Relative to the null baseline model, the top 7 competitors see accuracy rates at least 25 percentage points (or 50%) higher
- There is, however, a catch: conditional on having more-complex models, it does not appear to be worth it to introduce even more complexity
- For instance,
boosted2performs only slightly better thanen2using the same underlying recipe
- For instance,
Final Model Analysis
The final component of my predictive modelling process entails fitting my best-performing model workflow — boosted1 — to my testing set of roughly 9,000 observations. Prior to this, I take 2 prerequisite steps:
- I extract the underlying feature engineering specifications (i.e. the non-parametric kitchen sink recipe) as well as the optimal tuning hyperparameters (i.e.
mtry= 5,min_n= 2, andlearn_rate= -0.2 log-scaled) of my winning model,boosted1
- I then train the extracted model workflow on my whole ~35,000-row training dataset and save the result as a fitted model object
Finally, I apply the resulting fitted model to my testing set of ~9,000 rows — all in order to see how this winning model performs on never-before-seen-data. The following exhibits display my results.
Table 9 provides a side-by-side comparison between the actual and predicted injurious values of the 8,820 traffic collisions within my testing set, as well as the class probabilities assigned to each of the 2 levels — Yes and No — per collision.
boosted1 on testing data using accuracy
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 0.764966 |
Using the accuracy metric, Table 10 reveals that the winning boosted tree model correctly predicts the injury status of a given collision 76.497% of the time within the testing set.
Now, let’s further decompose this predictive assessment using a confusion matrix.
injurious class values.
Figure 4 confirms our finding from Table 10: (3361 + 3386) / (3361 + 3386 + 1024 + 1049) = 6747 / 8820 = 76.497% of training-set traffic collisions are correctly predicted. Notably:
- 3,386 predictions are true positives — injurious collisions that are correctly predicted to be injurious by the model
- 3,361 predictions are true negatives — non-injurious collisions that are correctly predicted to be non-injurious by the model
- 1,049 predictions are false positives — collisions that are predicted to be injurious but are actually non-injurious
- 1,024 predictions are false negatives — collisions that are predicted to be non-injurious but are actually injurious
Overall, a final accuracy metric of 76.497% on the assessment data is a slight decline from 76.60% — the final model’s accuracy metric on the resampled data — which is a 0.103-percentage-point decrease. This is to be expected, since boosted1, which was trained across the resampled training folds, is now fit on “never-before-seen” data.
Overall, I believe that this is a solid performance: even on never-before-seen data, this final model’s accuracy metric is a) over 25 percentage points (50%) more accurate than the null model and b) over 11 percentage points (17%) more accurate than the naive Bayes model, both of which are assessed across the 15 resamples. While this is not a magnificent performance, I believe it does justify building more complex models beyond just the baselines.
Still, this model is far from perfect; analyses from previous reports identify 3 key caveats to the efficacy of this final model:
- The underlying kitchen sink recipe matched with this recipe is not necessarily optimal
- Its superior performance against my feature engineered recipe may point towards a flaw in my feature engineering — not necessarily the dominance of a kitchen sink selection strategy
- Its superior performance against my feature engineered recipe may point towards a flaw in my feature engineering — not necessarily the dominance of a kitchen sink selection strategy
- The “objectively optimal” hyperparameter values for
min_n,mtry, andlearn_rateare still not exactly known
- This analysis merely identified the best-performing tuning set among 64 possible boosted-tree-model combinations, but more combinations exist and should be explored if computational constraints allow for it
- This analysis merely identified the best-performing tuning set among 64 possible boosted-tree-model combinations, but more combinations exist and should be explored if computational constraints allow for it
- Although
boosted1is the top performer among candidates in Table 6, the differences across the upper 50% of candidates are marginal at-best
- If this model is “any good”, then it’s possible that the runner-up models are as well
Conclusion
In summary, this report details my journey of predicting traffic collision injury outcomes in Chicago. Utilizing a robust dataset and strategic downsampling/downsampling, the boosted tree model, fused with the “kitchen sink” recipe, emerges as the top performer with an average accuracy rate of 76.60% across resamples. The analysis delves into the nuanced relationships among hyperparameters, emphasizing the importance of tuning, especially for the learn_rate parameter. The final model, when evaluated on a separate testing set, maintains a solid accuracy rate of 76.497%, surpassing the null and naive Bayes baseline models. This report concludes by advocating for continuous refinement and exploration to enhance predictive modelling robustness — a practice that is especially important when it comes to devising targeted traffic interventions that can potentially save thousands of lives.
References
How many licensed drivers are there in the US? (2018). Hedges & Company. https://hedgescompany.com/blog/2018/10/number-of-licensed-drivers-usa/#:~:text=Across%20all%20age%20groups%2C%2084.1,population%20has%20a%20driver’s%20license
Yellman, M. A. & Sauber-Schatz, E. K. (2022). Motor Vehicle Crash Deaths — United States and 28 Other High-Income Countries, 2015 and 2019. Morbidity and Mortality Weekly Report (MMWR), 71(26), 837-843. https://www.cdc.gov/mmwr/volumes/71/wr/mm7126a1.htm?s_cid=mm7126a1_w#suggestedcitation
Appendix: EDA
In this section, I detail the bivariate and multivariate EDA exhibits used to justify my predictor and interaction term selections for use in my feature engineered recipe. I also detail the univariate explorations conducted on factor-based features to explore potential class imbalances.
Recall that the feature selection/engineering steps unique to recipe1_parametric and recipe1_tree consist of the following components:
- 11 predictors/“features” 12:
alignment,posted_speed_limit,lane_cnt,intersection_related_i,trafficway_type,device_condition,report_type,first_crash_type,num_units,lighting_condition, andmonth
- 4 interaction terms created using 5 features —
lighting_condition,num_units,trafficway_type_,intersection_related_i, andalignment— and defined as the following…
- An interaction term between
num_unitsandlighting_condition
- An interaction term between
num_unitsandtrafficway_type
- An interaction term between
alignmentandlighting_condition
- An interaction term between
alignmentandintersection_related_i
- An interaction term between
12 refer to README.md within the data/ subdirectory for variable definitions
The following 3 subsections are dedicated to bivariate, multivariate, and univariate EDA exhibits, respectively.
Bivariate EDA: Feature Selection
1. Street Alignment
alignment: “Street alignment at crash location, as determined by reporting officer.”
A higher proportion of collisions occurring on streets with curved alignments are injurious.
2. Posted Speed Limit
posted_speed_limit: “Posted speed limit, as determined by reporting officer.”
| injurious | Mean speed limit |
|---|---|
| Yes | 29.65187 |
| No | 28.23470 |
On average, injurious collisions occur on roadways with slightly-higher posted speed limits.
3. Lane Count
lane_cnt: “Total number of through lanes in either direction, excluding turn lanes, as determined by reporting officer (0 = intersection).”
| injurious | Mean lane count |
|---|---|
| Yes | 2.746064 |
| No | 2.585954 |
On average, injurious collisions occur on roadways with slightly more lanes.
5. Trafficway Type
trafficway_type: “Trafficway type, as determined by reporting officer.”
Collisions occurring in four-ways tend to be the most injurious; collisions occuring in parking lots tend to be the least injurious.
6. First Crash Type
first_crash_type: “Type of first collision in crash.”
Injurious collisions are more likely to involve non-motorists.
7. Collision Report Type
report_type: “Administrative report type (at scene, at desk, amended).”
Injurious collisions are more likely to be reported on-scene; non-injurious collisions are more likely to be reported at-desk.
8. Number of Units Involved
num_units: “Number of units involved in the crash. A unit can be a motor vehicle, a pedestrian, a bicyclist, or another non-passenger roadway user. Each unit represents a mode of traffic with an independent trajectory.”
| injurious | Mean number of units involved |
|---|---|
| Yes | 2.130747 |
| No | 2.022371 |
On average, injurious collisions tend to involved slightly more units.
9. Traffic Device Condition
device_condition: “Condition of traffic control device, as determined by reporting officer.”
Strangely, injurious collisions are less likely to occur near poorly functioning traffic control devices. I would expect the opposite to be true.
10. Month of Collision
month: “The month component of [the collision’s date of occurrence].”
| injurious | Average collision month |
|---|---|
| Yes | 6.874865 |
| No | 6.692368 |
Injurious collisions tend to occur slightly later in a given year.
11. Lighting Conditions
lighting_condition: “Light condition at time of crash, as determined by reporting officer.”
Relative to their non-injurious counterparts, injurious collisions are slightly more likely to occur under non-daylight lighting conditions.
Multivariate EDA: Interaction Term Selection
Here, I will display the exploratory analyses conducted to discover and justify my 4 interaction terms.
1) num_units * lighting_condition
Does the effect of collision scope on injuriousness potentially depend on external lighting conditions?
The positive association between incident scope (num_units) and injury status (injurious) appears to be much stronger for collisions occurring under daylight.
2) num_units * trafficway_type
Does the effect of collision scope on injuriousness potentially depend on external trafficway conditions?
A distinct linear relationship between collision scope (num_units) and injury status (injurious) applies only to collisions for whom trafficway_type = “FOUR WAY”; for the other trafficway categories, the relationship is “U-shaped” and/or unclear.
3) alignment * lighting_condition
Does the effect of street alignment on injuriousness potentially depend on external lighting conditions?
The extent to which curved streets (alignment = “CURVE) contribute to higher injuriousness (injurious = YES) depends slightly on external lighting conditions.
Univariate EDA: Predictor Class Imbalances
8 of my 11 predictors used in the feature engineered recipes are categorical/factorial/nominal. As such, I dedicate the following univariate exhibits to exploring the class imbalances among them.
1. Street Alignment
| alignment | n |
|---|---|
| CURVE ON GRADE | 321 |
| CURVE ON HILLCREST | 105 |
| CURVE, LEVEL | 1468 |
| STRAIGHT AND LEVEL | 171535 |
| STRAIGHT ON GRADE | 2403 |
| STRAIGHT ON HILLCREST | 554 |
3. Trafficway Type
| trafficway_type | n |
|---|---|
| DIVIDED | 41973 |
| FOUR WAY | 16156 |
| NOT DIVIDED | 75269 |
| ONE-WAY | 18365 |
| PARKING LOT | 8350 |
| Other | 16273 |
4. First Crash Type
| first_crash_type | n |
|---|---|
| Motorist | 144301 |
| Animal | 106 |
| Object | 10825 |
| Other | 542 |
| Non-motorist | 20612 |
5. Collision Report Type
| report_type | n |
|---|---|
| AMENDED | 49 |
| NOT ON SCENE (DESK REPORT) | 70116 |
| ON SCENE | 99048 |
| NA | 7173 |
6. Traffic Device Condition
| device_condition | n |
|---|---|
| Bad | 92468 |
| Good | 71516 |
| Unknown | 12402 |
7. Lighting Condition
| lighting_condition | n |
|---|---|
| DARKNESS | 8121 |
| DARKNESS, LIGHTED ROAD | 43181 |
| DAWN | 3098 |
| DAYLIGHT | 111214 |
| DUSK | 5255 |
| UNKNOWN | 5517 |
8. Month of Collision
| month | n |
|---|---|
| 1 | 14000 |
| 2 | 11776 |
| 3 | 12528 |
| 4 | 12431 |
| 5 | 14760 |
| 6 | 15256 |
| 7 | 15697 |
| 8 | 16093 |
| 9 | 16334 |
| 10 | 17069 |
| 11 | 15284 |
| 12 | 15158 |